Methodology¶
Note¶
The model's out-of-sample performance on datasets from different time periods is moderate due to the effects of COVID-19 on the 2020 dataset. While the ROC AUC for the 2015 test set is 83%, it decreases to 74% for the 2020 validation set—a drop of about 10%. If different time periods are used for the validation set, the out-of-sample ROC AUC improves to around 78%. However, these results are not included in this report.
Data Sources¶
1. Freddie Mac Single Family Loan-Level Dataset¶
- Usage:
- 2015 dataset for training and testing
- 2020 dataset for validation
- Source: Freddie Mac Single Family Loan-Level Dataset
2. Freddie Mac Mortgage Rates¶
- Usage: Monthly mean mortgage rates for 15-year and 30-year fixed mortgages
- Source: Freddie Mac Primary Mortgage Market Survey (PMMS)
3. House Price Index (HPI)¶
4. Unemployment Rate (State & National Levels)¶
5. Consumer Price Index (CPI) (Removed)¶
- Note: The CPI data was considered but ultimately removed from the analysis.
- Source: U.S. Bureau of Labor Statistics (BLS) CPI Data
Files¶
mortgage_feature_analysis.ipynb: Contains the data analysis and feature engineering processes.model.py: Includes model definitions, hyperparameter tuning, and training functions.helper_functions.py: Contains utility functions for data cleaning, processing, and visualization.
Feature Engineering¶
Monthly Mean Rates: Obtained from Freddie Mac Mortgage Rates for 15-year and 30-year fixed mortgages.
Interest Rate Difference (%): $$ \text{Interest Rate Diff (\%)} = \frac{\text{Original Interest Rate} - \text{Monthly Mean}}{\text{Monthly Mean}} \times 100 $$Spread at Origination (SATO): $$ \text{SATO} = \text{Original Interest Rate} - \text{U.S. 30-Year FRM (Monthly Mean)} $$Z-score at Origination (ZATO):- Since Freddie Mac does not provide the standard deviation for each month, it was calculated using the 2015 and 2020 datasets: $$ \text{ZATO} = \frac{\text{Original Interest Rate} - \text{Mean Original Interest Rate}}{\text{Standard Deviation of Original Interest Rate}} $$
index_sa_state_mom12: Represents the 12-month momentum for state-level data.
Final Feature Set Used¶
Categorical Features:
first_time_homebuyer_flagoccupancy_statusloan_purposeproperty_state
Numerical Features:
borrowers_times_credit_scoresato_f30zatocredit_scoreoriginal_debt_to_income_ratiooriginal_upboriginal_loan_termoriginal_loan_to_valueinterest_diff_percentagenumber_of_borrowersindex_sa_state_mom12State Unemployment Ratecredit_score_times_debt_to_income_ratiocredit_score_times_loan_to_value
Model¶
Algorithms Used¶
- Logistic Regression
- Random Forest
- XGBoost
Data Splitting¶
- 2015 Dataset:
- Split into training (75%) and testing (15%) sets.
- 2020 Dataset:
- Used as the validation set.
Evaluation Metric¶
- ROC AUC: Used to evaluate model performance.
Training and Tuning Process¶
- Training: Models were trained using the training set.
- Hyperparameter Tuning:
- Performed using Optuna, a Bayesian hyperparameter tuning framework.
- The test set was used to tune the models.
- Validation: The validation set was used for final performance evaluation.
Feature Importance¶
Two methods were employed to assess feature importance:
- Permutation Importance:
- Randomly permuate a feature and evaluate the change in model performance: Permutation Importance
- Tree-Based Impurity Importance:
- Applied to tree-based models like Random Forest and XGBoost when available.
Results¶
Based on both permutation importance and tree-based importance methods, the feature importance scores are as follows:
- SATO has a slightly higher importance than ZATO.
- Both SATO and ZATO are significantly more important than Interest Difference in Percentage (IDP).
Loading¶
Click on CODE button on the right and you will see the underlying code
You can also switch between tabsets, eg. Histogram, Binning section
%load_ext pretty_jupyter
%load_ext autoreload
%autoreload 2
import numpy as np
import pandas as pd
import psycopg2
from imblearn.over_sampling import SMOTE
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import helper.helper_functions as hp
import model as m
# jupyter nbconvert --to html --template pj mortage_feature_analysis.ipynb
# Marco
df_hpi = pd.read_csv('Data/hpi_po_state.csv', parse_dates=['date'])
df_hpi['index_sa_state_mom12'] = df_hpi.groupby('state')['index_sa'].pct_change(12)
df_cpi = pd.read_csv('Data/CPIAUCSL.csv', parse_dates=['DATE'])
df_unemp_state = pd.read_csv('Data/unemp_rate.csv')
df_unemp_state = pd.melt(df_unemp_state, id_vars=['State'], var_name='Date', value_name='Unemployment Rate')
df_unemp_state['Date'] = pd.to_datetime(df_unemp_state['Date'], format='%b-%y')
df_unemp_state['State Abbreviation'] = df_unemp_state['State'].map(hp.state_abbreviations)
df_unemp_national = pd.read_csv('Data/UNRATE.csv', parse_dates=['DATE'])
# Freddie Mac weekly mortgage rates
df_interest_rate = pd.read_excel('Data/Freddie_interest_rate.xlsx', parse_dates=['Week'])
# Performance
with psycopg2.connect(database="postgres", host="localhost", user="postgres", password="542623", port="5432") as conn:
sql_svcg_2015 = "SELECT * FROM sample_svcg_2015;"
df_svcg_2015 = pd.read_sql(sql_svcg_2015, conn, parse_dates=['monthly_reporting_period'])
sql_svcg_2020 = "SELECT * FROM sample_svcg_2020;"
df_svcg_2020 = pd.read_sql(sql_svcg_2020, conn, parse_dates=['monthly_reporting_period'])
# Origination
with psycopg2.connect(database="postgres", host="localhost", user="postgres", password="542623", port="5432") as conn:
sql_orig_2015 = "SELECT * FROM sample_orig_2015;"
df_orig_2015 = pd.read_sql(sql_orig_2015, conn, parse_dates=['first_payment_date'])
sql_orig_2020 = "SELECT * FROM sample_orig_2020;"
df_orig_2020 = pd.read_sql(sql_orig_2020, conn, parse_dates=['first_payment_date'])
Feature Enginnering¶
Calculating D60, SATO, ZATO, IDP¶
df_orig_2015['30yrFRM'] = np.where(df_orig_2015['original_loan_term'] / 12 >= 25, True, False)
df_orig_2020['30yrFRM'] = np.where(df_orig_2020['original_loan_term'] / 12 >= 25, True, False)
df_interest_rate['Month'] = df_interest_rate['Week'].dt.month
df_interest_rate['Year'] = df_interest_rate['Week'].dt.year
monthly_avg_rate = df_interest_rate.groupby(['Year', 'Month']).mean().reset_index()
monthly_avg_rate['Date'] = monthly_avg_rate['Year'].astype(str) + '-' + monthly_avg_rate['Month'].astype(str) + '-' + '01'
monthly_avg_rate['Date'] = pd.to_datetime(monthly_avg_rate['Date'])
monthly_avg_rate.drop(['Year', 'Month', 'Week'], axis=1, inplace=True)
df_orig_2015 = hp.process_yearly_data(df_orig_2015, monthly_avg_rate)
df_orig_2020 = hp.process_yearly_data(df_orig_2020, monthly_avg_rate)
df_orig_2015 = hp.process_loan_data(df_orig_2015, df_svcg_2015)
df_orig_2020 = hp.process_loan_data(df_orig_2020, df_svcg_2020)
data_2015_counts = df_orig_2015['ever_D60_3years_flag'].value_counts()
data_2020_counts = df_orig_2020['ever_D60_3years_flag'].value_counts()
data_2015_percentages = data_2015_counts / data_2015_counts.sum() * 100
data_2020_percentages = data_2020_counts / data_2020_counts.sum() * 100
fig, axes = plt.subplots(1, 2, figsize=(14, 7))
axes[0].pie(data_2015_percentages,
labels=data_2015_percentages.index,
autopct='%1.1f%%',
)
axes[0].set_title('2015 Target Variable Distribution', fontsize=14)
axes[1].pie(data_2020_percentages,
labels=data_2020_percentages.index,
autopct='%1.1f%%',
)
axes[1].set_title('2020 Target Variable Distribution', fontsize=14)
plt.tight_layout()
plt.show()
Histogram¶
features = [
{'column': 'interest_diff_percentage', 'title': 'Interest Rate Difference Histogram', 'xlabel': 'Interest Rate Difference Percentage'},
{'column': 'sato_f30', 'title': 'SATO Histogram', 'xlabel': 'SATO'},
{'column': 'zato', 'title': 'ZATO Histogram', 'xlabel': 'ZATO'}
]
for feature in features:
plt.hist(df_orig_2015[feature['column']], bins=20, color='blue', alpha=0.5, label='2015')
plt.hist(df_orig_2020[feature['column']], bins=20, color='green', alpha=0.5, label='2020')
plt.title(feature['title'])
plt.xlabel(feature['xlabel'])
plt.ylabel('Frequency')
plt.legend()
plt.show()
Binning¶
df_orig_2015 = hp.merge_data(df_orig_2015, df_hpi=df_hpi, df_unemp_state=df_unemp_state, df_unemp_national=df_unemp_national, df_cpi=df_cpi)
df_orig_2020 = hp.merge_data(df_orig_2020, df_hpi=df_hpi, df_unemp_state=df_unemp_state, df_unemp_national=df_unemp_national, df_cpi=df_cpi)
df_orig_2015_reg_bins = df_orig_2015.copy()
df_orig_2020_reg_bins = df_orig_2020.copy()
# convert credit score, dti, ltv into categorical bins
# df_orig_2015_reg_bins = hp.bin_columns(df_orig_2015_reg_bins, hp.binning_config)
# df_orig_2020_reg_bins = hp.bin_columns(df_orig_2020_reg_bins, hp.binning_config)
Model¶
test_size = 0.25
X = df_orig_2015_reg_bins[hp.categorical_features_ml + hp.numerical_features_ml]
y = df_orig_2015_reg_bins["ever_D60_3years_flag"]
X_val = df_orig_2020_reg_bins[hp.categorical_features_ml + hp.numerical_features_ml]
y_val = df_orig_2020_reg_bins["ever_D60_3years_flag"]
if test_size == 1:
X_train, X_test, y_train, y_test = X, X, y, y
else:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=hp.random_state, stratify=y)
X_train_preprocessed, X_test_preprocessed, X_val_preprocessed, feature_names = hp.preprocess_data(X_train, X_test, X_val, hp.numerical_features_ml, hp.categorical_features_ml)
# smote = SMOTE(random_state=42, k_neighbors=5)
# X_train_preprocessed, y_train = smote.fit_resample(X_train_preprocessed, y_train)
Logistics¶
logistics = m.ml_model(model_name="logistics")
best_params = logistics.tune(X_train_preprocessed, y_train, X_val_preprocessed, y_val, n_trials=50)
print("=== Logistic Regression ===")
logistics.train(X_train_preprocessed, y_train)
print(f"ROC AUC train: {roc_auc_score(y_train, logistics.model.predict_proba(X_train_preprocessed)[:, 1])}")
print(f"ROC AUC test: {roc_auc_score(y_test, logistics.model.predict_proba(X_test_preprocessed)[:, 1])}")
print(f"ROC AUC val: {roc_auc_score(y_val, logistics.model.predict_proba(X_val_preprocessed)[:, 1])}")
hp.plot_optimization_history(logistics.study)
Best parameters: {'penalty': 'l2', 'C': 0.00031156658875241427, 'solver': 'newton-cholesky'}
=== Logistic Regression ===
ROC AUC train: 0.7953032144441419
ROC AUC test: 0.8043236870542857
ROC AUC val: 0.7201244182764975
Random Forest¶
rf = m.ml_model(model_name="rf")
best_params = rf.tune(X_train_preprocessed, y_train, X_val_preprocessed, y_val, n_trials=100)
print("=== Random Forest ===")
rf.train(X_train_preprocessed, y_train)
print(f"ROC AUC train: {roc_auc_score(y_train, rf.model.predict_proba(X_train_preprocessed)[:, 1])}")
print(f"ROC AUC test: {roc_auc_score(y_test, rf.model.predict_proba(X_test_preprocessed)[:, 1])}")
print(f"ROC AUC val: {roc_auc_score(y_val, rf.model.predict_proba(X_val_preprocessed)[:, 1])}")
hp.plot_optimization_history(rf.study)
Best parameters: {'n_estimators': 471, 'max_depth': 16, 'min_samples_split': 10, 'min_samples_leaf': 48, 'max_leaf_nodes': 69, 'max_samples': 0.8663158031758996, 'max_features': 0.21694370674765878}
=== Random Forest ===
ROC AUC train: 0.9022235572225017
ROC AUC test: 0.8160807097871756
ROC AUC val: 0.7352164678905424
XGBoost¶
xgboost = m.ml_model(model_name="xgbt")
best_params = xgboost.tune(X_train_preprocessed, y_train, X_val_preprocessed, y_val, n_trials=100)
print("=== XGBoost ===")
xgboost.train(X_train_preprocessed, y_train)
print(f"ROC AUC train: {roc_auc_score(y_train, xgboost.model.predict_proba(X_train_preprocessed)[:, 1])}")
print(f"ROC AUC test: {roc_auc_score(y_test, xgboost.model.predict_proba(X_test_preprocessed)[:, 1])}")
print(f"ROC AUC val: {roc_auc_score(y_val, xgboost.model.predict_proba(X_val_preprocessed)[:, 1])}")
hp.plot_optimization_history(xgboost.study)
Best parameters: {'eta': 0.01615951221706834, 'gamma': 0.21564948239941945, 'n_estimators': 384, 'subsample': 0.7482412107697541, 'sampling_method': 'uniform', 'colsample_bytree': 0.7495123114197679, 'colsample_bylevel': 0.6489886876116951, 'colsample_bynode': 0.6620656835727575, 'max_depth': 7, 'min_child_weight': 28, 'lambda': 0.007797526601282081, 'alpha': 0.03965404191946481, 'tree_method': 'hist', 'grow_policy': 'depthwise'}
=== XGBoost ===
ROC AUC train: 0.8539426006005897
ROC AUC test: 0.8192558455873731
ROC AUC val: 0.7377779321322725
Feature Importance¶
Random Forests¶
importance_df_rf = hp.compute_group_permutation_importance(
rf.train(X_train_preprocessed, y_train).model,
X_train,
X_test,
X_val_preprocessed,
X_val,
y_val,
columns=X_val.columns,
n=5
)
hp.plot_feature_importance(importance_df_rf, rf)
hp.plot_tree_feature_importance(rf, feature_names)
XGBoost¶
importance_df_xgboost = hp.compute_group_permutation_importance(
xgboost.train(X_train_preprocessed, y_train).model,
X_train,
X_test,
X_val_preprocessed,
X_val,
y_val,
columns=X_val.columns,
n=10
)
hp.plot_feature_importance(importance_df_xgboost, xgboost)
hp.plot_tree_feature_importance(xgboost, feature_names)
Logistics¶
importance_df_logistics = hp.compute_group_permutation_importance(
logistics.train(X_train_preprocessed, y_train).model,
X_train,
X_test,
X_val_preprocessed,
X_val,
y_val,
columns=X_val.columns,
n=5
)
hp.plot_feature_importance(importance_df_logistics, logistics)
Overall Comparsion¶
hp.plot_model_feature_importance_comparison(
model_importance_dfs=[importance_df_logistics, importance_df_rf, importance_df_xgboost],
model_names=['Logistics', 'Random Forest', 'XGBoost']
)
Performance¶
Out of Sample, using validation data from 2020
prob_thresholds = 0.04
Random Forests¶
y_pred_prob = rf.model.predict_proba(X_val_preprocessed)[:, 1]
y_pred = np.where(y_pred_prob > prob_thresholds, 1, 0)
hp.plot_density(y_pred_prob, y_pred, y_val)
XGBoost¶
y_pred_prob = xgboost.model.predict_proba(X_val_preprocessed)[:, 1]
y_pred = np.where(y_pred_prob > prob_thresholds, 1, 0)
hp.plot_density(y_pred_prob, y_pred, y_val)
Logistics¶
y_pred_prob = logistics.model.predict_proba(X_val_preprocessed)[:, 1]
y_pred = np.where(y_pred_prob > prob_thresholds, 1, 0)
hp.plot_density(y_pred_prob, y_pred, y_val)